Here we continue from on from the Bioinformatics element of the workflow in Part 1. We load in the phyloseq object we made in Part 1, which contains our sequences (ASVs), their taxonomic assignments, their phylogenetic relationships (tree), and the sample metadata that decodes which individual is from which population, as well as individual level trait data like mass etc.
Packages we’ll need for all the stats
library(phyloseq) #microbiome data handling
library(ggplot2) #plotting
library(dplyr) # data handling
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(vegan) #plotting of community data like bacterial microbiomes and statistical models
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(decontam) #identifying contaminants in our sequencing workflow
library(microbiome) # microbiome plotting and convenience functions
##
## microbiome R package (microbiome.github.com)
##
##
##
## Copyright (C) 2011-2022 Leo Lahti,
## Sudarshan Shetty et al. <microbiome.github.io>
##
## Attaching package: 'microbiome'
## The following object is masked from 'package:vegan':
##
## diversity
## The following object is masked from 'package:ggplot2':
##
## alpha
## The following object is masked from 'package:base':
##
## transform
library(ggordiplots) # tidyverse versions of the vegan plots
## Loading required package: glue
library(MASS) #generalized linear models of pathogen infection data (Negative Binomial Error Structure)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(pheatmap) #pretty heatmaps
library(RColorBrewer) #Add Custom Colours (e.g. Colour Blind Friendy)
library(MuMIn)
Quick access options for output graphics to make legends and axis labels larger / more legible
#Global Plot Options
plotopts<- theme(axis.text=element_text(size=20),axis.title=element_text(size=22),strip.text=element_text(size=22),legend.title = element_text(size=20),legend.text = element_text(size=20))
Load in Our Phyloseq Object
load('Frog_Spatial_Phyloseq.RData')
Note the unique way of accessing the different elements of the phyloseq object. The three commands are:
Sample Data: sample_data(ps)
Taxonomy: tax_table(ps)
ASV Matrix otu_table(ps)
Note how even though we’re using ASVs, phyloseq still refers to them as OTUs in the abundance matrix.
head(otu_table(ps))
## OTU Table: [6 taxa and 31 samples]
## taxa are rows
## Bacterial_2 Nc2 P1F1 P1F10 P1F11 P1F12 P1F13
## 355f1aba5e04e808233fac8805dd1594 0 0 0 0 0 0 0
## edd9e404caa53425e090356336d8be6f 0 0 0 0 0 2 0
## 998ca566e4074ce8ea844712d8bd6cd0 0 0 0 0 0 0 0
## def84c0be5db839bfd66a4b0baa519d9 0 0 0 0 0 0 0
## b1c1860c3a6c5a046a5854e2a939069b 0 0 0 5 0 0 0
## b69b59edc63b83ff1a63821bed095796 0 0 0 0 0 37 0
## P1F14 P1F15 P1F16 P1F2 P1F3 P1F4 P1F5 P1F6
## 355f1aba5e04e808233fac8805dd1594 0 0 0 0 0 0 0 0
## edd9e404caa53425e090356336d8be6f 0 0 0 0 0 0 0 0
## 998ca566e4074ce8ea844712d8bd6cd0 0 6 0 0 0 0 0 0
## def84c0be5db839bfd66a4b0baa519d9 0 0 0 0 0 0 0 0
## b1c1860c3a6c5a046a5854e2a939069b 0 0 0 0 0 0 0 0
## b69b59edc63b83ff1a63821bed095796 0 0 0 0 0 0 0 0
## P1F7 P1F8 P1F9 P4F1 P4F10 P4F11 P4F12 P4F17
## 355f1aba5e04e808233fac8805dd1594 0 0 0 0 0 0 0 0
## edd9e404caa53425e090356336d8be6f 0 0 0 0 0 0 0 0
## 998ca566e4074ce8ea844712d8bd6cd0 0 0 0 0 0 0 0 0
## def84c0be5db839bfd66a4b0baa519d9 0 0 0 0 0 0 0 0
## b1c1860c3a6c5a046a5854e2a939069b 0 0 0 0 0 0 0 0
## b69b59edc63b83ff1a63821bed095796 0 0 0 0 0 0 0 0
## P4F19 P4F2 P4F3 P4F5 P4F6 P4F7 P4F8 P4F9
## 355f1aba5e04e808233fac8805dd1594 0 2 0 0 0 0 0 0
## edd9e404caa53425e090356336d8be6f 0 0 0 0 0 0 0 0
## 998ca566e4074ce8ea844712d8bd6cd0 0 0 0 0 0 0 0 0
## def84c0be5db839bfd66a4b0baa519d9 0 0 6 0 0 0 0 0
## b1c1860c3a6c5a046a5854e2a939069b 0 0 0 0 0 0 0 0
## b69b59edc63b83ff1a63821bed095796 0 0 0 0 0 0 0 0
Here we do some basic cleaning of the data - removing ASVs with no Phylum assignment (probably junk), remove any Chloroplasts (sneaky Cyanobacteria) that have made it through the pipeline, and then any Archaea that also amplify at the 16S locus we chose.
#Prune Taxa With No Phylum Assignment
ps_prune<-prune_taxa(as.vector(!is.na(tax_table(ps)[,2])),ps)
ntaxa(ps)-ntaxa(ps_prune) #82 lost
## [1] 82
#Prune Chloroplasts
ps_prune_nochloro<-prune_taxa(as.vector(tax_table(ps_prune)[,3]!="Chloroplast"),ps_prune)
ntaxa(ps_prune)-ntaxa(ps_prune_nochloro) #266 lost
## [1] 157
#How Many Archaea?
sum(as.vector(tax_table(ps_prune_nochloro)[,1]=="Archaea"))
## [1] 43
#Remove Archaea
ps_prune_nochloro_noarchaea<-prune_taxa(as.vector(tax_table(ps_prune_nochloro)[,1]!="Archaea"),ps_prune_nochloro)
ntaxa(ps_prune_nochloro_noarchaea)-ntaxa(ps_prune_nochloro)
## [1] -43
It’s worth noting here that how to deal with negative contamination is a contentious issue. One strategy is to simply remove any sequence that popped up in the negative controls from ALL samples. This is intuitive and has the advantage of being objective, but sometimes you can pick up extremely low abundance sequences that are highly abundant in other samples - suggesting some nefarious aerosol of DNA contaminated the negative from a real sample. Noah Fierer and co. have written a blog on sources of contamination and how to deal with them
Here we’ll use the R package decontam to identify and remove negative contaminants based on their prevalence and abundance in negative controls versus ‘real’ samples. First we need a new variale that flags which sample(s) are negatives. It is not uncommon to have multiple negatives in these workflows - one for each extraction plate, PCR plate etc.
## Flag Negatives
sample_data(ps_prune_nochloro_noarchaea)$is_negative<- sample_data(ps_prune_nochloro)$SAMPLE.TYPE=="NegativeControl"
##Inspect Library SIzes
df <- as.data.frame(sample_data(ps_prune_nochloro_noarchaea)) # Put sample_data into a ggplot-friendly data.frame
df$LibrarySize <- sample_sums(ps_prune_nochloro_noarchaea)
df <- df[order(df$LibrarySize),]
df$Index <- seq(nrow(df))
ggplot(data=df, aes(x=Index, y=LibrarySize, color=is_negative)) + geom_point()
Now we identify contaminants based on their prevalence. The key variable here is “threshold which will define how strict our identification is. From the help file:
THRESHOLD: The probability threshold below which (strictly less than) the null-hypothesis (not a contaminant) should be rejected in favor of the alternate hypothesis (contaminant).
#Negatives Based on Prevalence // Threshold 0.5
contamdf.prev <- isContaminant(ps_prune_nochloro_noarchaea, method="prevalence", neg="is_negative",threshold=0.5)
table(contamdf.prev$contaminant)
##
## FALSE TRUE
## 5643 8
#Plot
ps.pa <- transform_sample_counts(ps_prune_nochloro_noarchaea, function(abund) 1*(abund>0))
ps.pa.neg <- prune_samples(sample_data(ps.pa)$is_negative == TRUE, ps.pa)
ps.pa.pos <- prune_samples(sample_data(ps.pa)$is_negative == FALSE, ps.pa)
# Make data.frame of prevalence in positive and negative samples
df.pa <- data.frame(pa.pos=taxa_sums(ps.pa.pos), pa.neg=taxa_sums(ps.pa.neg),
contaminant=contamdf.prev$contaminant)
ggplot(data=df.pa, aes(x=pa.neg, y=pa.pos, color=contaminant)) + geom_point() +
xlab("Prevalence (Negative Controls)") + ylab("Prevalence (True Samples)")
####### PRUNE OUT
ps_clean<-prune_taxa(contamdf.prev$contaminant==FALSE,ps_prune_nochloro_noarchaea)
######## FILTER OUT POSITIVES
#ps_clean_positives<-prune_samples(sample_data(ps_clean)$SAMPLE.TYPE=="Positive_Control",ps_clean)
######### SUBSET TO ONLY FROGS
ps_clean_samples<-prune_samples(sample_data(ps_clean)$SAMPLE.TYPE %in% c("FROG"),ps_clean)
ps_clean_samples #check what we're left with
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 5643 taxa and 29 samples ]
## sample_data() Sample Data: [ 29 samples by 11 sample variables ]
## tax_table() Taxonomy Table: [ 5643 taxa by 6 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 5643 tips and 5642 internal nodes ]
## refseq() DNAStringSet: [ 5643 reference sequences ]
Now that we’ve done that, we can check what our post-QC library sizes are. It’s a good idea to report these in manuscripts and thesis chapters. Here was ask what the minimum, maximum and mean library sizes-per-samples are.
############## POST QC LIBRARY SIZES
###############
# Post-QC Library Stats
###############
mean(sample_sums(ps_clean_samples))
## [1] 17352.55
range(sample_sums(ps_clean_samples))
## [1] 12136 21045
We might be interested in checking that our per-group coverage is roughly equal. Lack of equal coverage might occur if a partocular subset of samples amplify poorly and are hard to pool at the same concentration as other samples. They might be older samples, or from a different sampling location, that systematically exhibit lower coverage. We’ve had reviewers ask for these checks before, so it might be wise to do them as a matter of course.
#Make a data frame of read depths
frog_reads<-data.frame(reads=sample_sums(ps_clean_samples))
#Add on the sample ID
frog_reads$sample<-rownames(frog_reads)
#Join on the Metadata
frog_meta<-as(sample_data(ps_clean_samples),"data.frame")
frog_reads<-left_join(frog_reads,frog_meta,"sample")
#Some Boxplots of Coverage by Population using ggplot2
ggplot(frog_reads,aes(x=SITE,y=reads)) + geom_boxplot()
Rarefaction curves allow us to assess whether we’ve likely discovered all the microbial ‘species’ / ASVs in a sample. The lower the true species diversity, and the higher the sampling depth (number of reads), the more likely we are to have saturated our species discovery curves. We can plot per-sample rates of species discovery using the ‘rarecurve’ function in vegan. By default this will plot a different line for each sample, so if you have hundreds of samples, these graphs can look very cluttered!
#Strip Out the OTU Table
ps_clean_otutable<-as(t(otu_table(ps_clean_samples)),"matrix")
#Rarefaction Curves
rarecurve(ps_clean_otutable,step=50)
## Warning in rarecurve(ps_clean_otutable, step = 50): most observed count data
## have counts 1, but smallest count is 2
Ok - now we’re ready to do some calculations. But first, we need to remove any artefacts caused by uneven sampling depth across samples, using the command rarefy_even_depth. The rarefaction curves above should convince us where we can subsample data to (in 000s of reads) without losing information about community composition.
This command will automatically rarefy to the lowest-coverage sample sampling depth if you do not specify a sampling depth. This means you won’t lose any samples, but could potentially throw away lots of data if there’s a single library that amplified poorly.
What is super important is that you specify a random number seed. This is because the rarefying step is based on a random subsample of each library, and providing a random number seed means your analysis will be completely reproducible (rather than differing every tme because of sampling error)
####### Rarefying Samples
#remind ourselves what the minimum coverage ius
min(sample_sums(ps_clean_samples))
## [1] 12136
# Rarefy to lowest library size, set random number see for reproducibility
ps_rare<-rarefy_even_depth(ps_clean_samples,rngseed = 150517)
## `set.seed(150517)` was used to initialize repeatable random subsampling.
## Please record this for your records so others can reproduce.
## Try `set.seed(150517); .Random.seed` for the full vector
## ...
## 204OTUs were removed because they are no longer
## present in any sample after random subsampling
## ...
We can rarefy to custom depths as well, but will get warnings if we set the value too high and lose samples with lower depth of coverage
# Rarefy to custom depth
ps_rare_custom<-rarefy_even_depth(ps_clean_samples,sample.size=15000,rngseed = 150517)
## `set.seed(150517)` was used to initialize repeatable random subsampling.
## Please record this for your records so others can reproduce.
## Try `set.seed(150517); .Random.seed` for the full vector
## ...
## 5 samples removedbecause they contained fewer reads than `sample.size`.
## Up to first five removed samples are:
## P1F3P1F8P4F1P4F11P4F6
## ...
## 811OTUs were removed because they are no longer
## present in any sample after random subsampling
## ...
Note how we get helpful warnings about which samples we lost, and also which sequences we lost because they were found only in those samples. Again this demonstrates how phyloseq operates on the whole phyloseq object and how all the tables within an object are linked.
Now we want to now the Richness/Diversity of our samples. phyloseq has many built in metrics, all of which differ in the assumptions they make about the data & what they are measuring e.g. pure richness (Observed), or richness weighted by evenness (Shannon)
It’s important to note that these metrics will all be correlated because they all measure similar things. Some papers will report tests on multiple alpha diversity values, but this arguably exposes them to multiple testing issues - that is increased risk of finding a false positive by asking the same statistical question over and over.
#Estimate Observed RIchness and Shannon Diversity
frog_rich<-estimate_richness(ps_rare,measures=c('Shannon','Observed'))
#Add On Sample metadata
frog_rich$sample<-rownames(frog_rich)
frog_rich<-left_join(frog_rich,frog_meta,"sample")
site_richness1<-ggplot(frog_rich,aes(y=Observed,x=SITE)) + geom_violin(aes(fill=SITE)) + plotopts
site_richness1 + guides(fill="none")
rich_pathogen_plot1<-ggplot(frog_rich,aes(x=Observed,y=pathogen_GE,group=SITE)) + geom_point(size=5,shape=21,aes(fill=SITE)) + geom_smooth(method="lm")
rich_pathogen_plot1 + labs(y="Bd Infection (Genomic Equivalents)",x="Observed Bacterial Diversity") + plotopts
## `geom_smooth()` using formula 'y ~ x'
#Model (uses the MASS library)
m1<-glm.nb(pathogen_GE ~ Observed*SITE ,data=frog_rich)
summary(m1)
##
## Call:
## glm.nb(formula = pathogen_GE ~ Observed * SITE, data = frog_rich,
## init.theta = 3.090627629, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.93174 -0.83770 -0.09583 0.37863 1.58661
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.734001 0.734779 11.887 < 2e-16 ***
## Observed -0.003771 0.001433 -2.631 0.00850 **
## SITETugela -0.443454 0.832155 -0.533 0.59410
## Observed:SITETugela -0.005297 0.001821 -2.909 0.00362 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(3.0906) family taken to be 1)
##
## Null deviance: 117.22 on 28 degrees of freedom
## Residual deviance: 30.51 on 25 degrees of freedom
## AIC: 401.78
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 3.091
## Std. Err.: 0.787
##
## 2 x log-likelihood: -391.785
#Model Selection using AIC
m1_nointeraction<-update(m1,~.-Observed:SITE)
m1_obs<-glm.nb(pathogen_GE ~ Observed,data=frog_rich)
m1_site<-glm.nb(pathogen_GE ~ SITE ,data=frog_rich)
m1_null<-glm.nb(pathogen_GE ~ 1,data=frog_rich)
#Build The AICc table
model.sel(m1,m1_nointeraction,m1_obs,m1_site,m1_null)
## Model selection table
## (Int) Obs SIT Obs:SIT family init.theta df
## m1 8.734 -0.003771 + + NB(3.0906,l) 3.09 5
## m1_nointeraction 10.520 -0.007216 + NB(2.4907,l) 2.49 4
## m1_site 6.927 + NB(1.0632,l) 1.06 3
## m1_null 6.499 NB(0.8852,l) 0.885 2
## m1_obs 6.923 -0.001084 NB(0.9026,l) 0.903 3
## logLik AICc delta weight
## m1 -195.892 404.4 0.00 0.875
## m1_nointeraction -199.313 408.3 3.90 0.125
## m1_site -213.925 434.8 30.42 0.000
## m1_null -217.346 439.2 34.76 0.000
## m1_obs -216.978 440.9 36.52 0.000
## Abbreviations:
## family: NB(0.8852,l) = 'Negative Binomial(0.8852,log)',
## NB(0.9026,l) = 'Negative Binomial(0.9026,log)',
## NB(1.0632,l) = 'Negative Binomial(1.0632,log)',
## NB(2.4907,l) = 'Negative Binomial(2.4907,log)',
## NB(3.0906,l) = 'Negative Binomial(3.0906,log)'
## Models ranked by AICc(x)
A plea to never use BOTH AIC and p values. This is just to show how our inference may or may not change depending on your statistical worldview.
#A Likelihood Ratio Test of the Interaction
anova(m1,m1_nointeraction)
## Likelihood ratio tests of Negative Binomial Models
##
## Response: pathogen_GE
## Model theta Resid. df 2 x log-lik. Test df LR stat.
## 1 Observed + SITE 2.490703 26 -398.6252
## 2 Observed * SITE 3.090628 25 -391.7850 1 vs 2 1 6.840241
## Pr(Chi)
## 1
## 2 0.008912683
# Model - Is Mortality a Function of Pathogen Load
mortality_model1<-glm(mortality ~ pathogen_GE,family=binomial,data=frog_rich)
summary(mortality_model1)
##
## Call:
## glm(formula = mortality ~ pathogen_GE, family = binomial, data = frog_rich)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.21627 -0.29404 -0.18093 0.08171 2.17499
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.367710 1.596223 -2.736 0.00621 **
## pathogen_GE 0.006934 0.002903 2.389 0.01691 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 39.336 on 28 degrees of freedom
## Residual deviance: 12.514 on 27 degrees of freedom
## AIC: 16.514
##
## Number of Fisher Scoring iterations: 7
We can summarise our microbial communities using compositional barplots, which provide a quick and accessible way of visualising differences in taxonomy at different hierarchies.
The first step here is to remove samples in our phyloseq object with low reads, otherwise this will break our averaging
#Filter Samples With No Data
physeq_subset<-prune_samples(sample_sums(ps_rare)>0,ps_rare)
Next we tell phyloseq to aggregate reads at the taxonomic level of interest - here Phylum - and to transform the data to relative abundance (fraction of reads, rather than count of reads)
#Aggregate To Phylum Level and Transform to Relative Abundance
physeq_phylum <- physeq_subset %>%
aggregate_taxa(level = "Phylum") %>%
microbiome::transform(transform = "compositional")
Now we have some choices. We can plot the data with a bar for each sample:
#Plot Composition By Sample ID
physeq_sample_plot <- physeq_phylum %>%
plot_composition(sample.sort = "Proteobacteria")
physeq_sample_plot
Or Average By metadata of interest - perhaps we just want to look at average structure by SITE ID?
#Plot Composition Grouped By Sample Metadata
physeq_grouped_plot <- physeq_phylum %>%
plot_composition(sample.sort = "Proteobacteria", average_by = "SITE")
physeq_grouped_plot
The plots can get quite messy - so it’s not uncommon to just subset to the top ‘N’ of a taxonomic group for easier visualisation
#What Are the Names of the most abundant phyla?
physeq_phylumcollapse<- physeq_subset %>% aggregate_taxa(level="Phylum")
physeq_top5phyla = names(sort(taxa_sums(physeq_phylumcollapse), TRUE)[1:5])
physeq_top5phyla
## [1] "Proteobacteria" "Bacteroidota" "Cyanobacteria" "Actinobacteriota"
## [5] "Firmicutes"
#Subset the phyloseq object to those phyla
physeq_top5phylum_filter<-subset_taxa(physeq_subset,Phylum %in% physeq_top5phyla)
#Remake Our Graph but with no grouping (samples)
physeq_top5phylum_samples_plot <- physeq_top5phylum_filter %>%
aggregate_taxa(level = "Phylum") %>%
microbiome::transform(transform = "compositional") %>%
plot_composition(sample.sort = "Proteobacteria")
physeq_top5phylum_samples_plot
#Remake Our Graph but with averaging by SITE
physeq_top5phylum_site_plot <- physeq_top5phylum_filter %>%
aggregate_taxa(level = "Phylum") %>%
microbiome::transform(transform = "compositional") %>%
plot_composition(sample.sort = "Proteobacteria", average_by = "SITE")
physeq_top5phylum_site_plot
### Add Custom Colours (e.g. Colour Blind Friendy)
physeq_top5phylum_site_plot + scale_fill_brewer(palette="Set2")
#All The CBF palettes in RColorBrewer
display.brewer.all(colorblindFriendly = T)
##### HEATMAP (baked into phyloseq)
#Subset to Most 50 abundant OTUs
ps_rare_top50 <- prune_taxa(names(sort(taxa_sums(ps_rare),TRUE)[1:50]), ps_rare)
#Plot Heatmap with X axis ordered Inter-Sample Distance
plot_heatmap(ps_rare_top50,"NMDS",distance = "bray")
## Warning: Transformation introduced infinite values in discrete y-axis
#And again, with explicit ordering of samples by SITE
plot_heatmap(ps_rare_top50,"NMDS",distance = "bray",sample.label="SITE",sample.order = "SITE")
## Warning: Transformation introduced infinite values in discrete y-axis
There are plenty of heatmap packages in R, and one I quite like is the pheatmap package. It allows you to annotate the heatmaps with useful metadata, like sampling site etc.
Here we’ll plot a heatmap with pheatmap, and annotate samples based on SITE ID for that sample. Note we’re not clustering by column (sample) here, just letting them be plotted in site-order as they appear in the data frame.
#Generate Metadata for Plotting Colours (SITE ID)
site_data<-data.frame(Site=sample_data(ps_rare_top50)$SITE)
rownames(site_data)<-rownames(sample_data(ps_rare_top50))
#Strip Out OTU_Table
ps_rare_top50_otu<-otu_table(ps_rare_top50)
#Plot Heatmap - No Explicit Clustering
pheatmap(ps_rare_top50_otu,cluster_cols = FALSE,scale="row",annotation_col = site_data)
Now cluster columns by similarity
#As Above, But Order Columns (Samples) by Similarity
pheatmap(ps_rare_top50_otu,cluster_cols = TRUE,scale="row",annotation_col = site_data)
Now force ordering by SITE.
#And Again, but for Site Ordering
ps_rare_top50_otu_siteorder<-ps_rare_top50_otu[,order(site_data$Site)]
pheatmap(ps_rare_top50_otu_siteorder,cluster_cols = FALSE,scale="row",annotation_col = site_data)
It can be useful to visualize microbial community structure rather than rely on a matrix of numbers. There are many ways to do so, all with jazzy acronyms, including Non-Metric Multidimensional Scaling (NMDS), Principal Coordinates Analysis (PCoA), Constrained Correspondence Analysis (CCA), Detrended Correspondence Analysis (DCA) etc. They each make different assumptions about the data, but they share the same principle of reducing variation in the relative abundances hundreds of microbial taxa into 2 or 3 axes that can then be plotted/visualised, or statistically tested.
For an excellent guide to the pros and cons of all of these methods, see this paper by Paliy & Shankar
Here we’ll use the built in phyloseq functions to perform NMDS ordination on Bray-Curtis distances among samples. We specify ‘k=2’ to state that the variation should be condensed into 2 axes.
As a general rule, the ‘stress’ value of an NMDS ordination
should be below 15% / 0.15, and ideally below 10%. Stress is a measure
of how ‘easy’ it was to reduce the variation in your microbial abundance
matrix into the number of axes you specify. If the stress value is too
large, or the model doesn’t converge, then try setting k to 3. If you do
this, make sure to state in your methods of your paper that the
ordination was done across 3 axes.
A great tutorial on NMDS and Stress here
####### NMDS ORDINATION
twosite_ord_NMDS <- phyloseq::ordinate(ps_rare, method = "NMDS",distance="bray")
## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.1147353
## Run 1 stress 0.1297722
## Run 2 stress 0.1147353
## ... Procrustes: rmse 2.252791e-06 max resid 6.612889e-06
## ... Similar to previous best
## Run 3 stress 0.1147353
## ... Procrustes: rmse 1.600478e-06 max resid 4.583389e-06
## ... Similar to previous best
## Run 4 stress 0.1147353
## ... Procrustes: rmse 2.356444e-06 max resid 4.738009e-06
## ... Similar to previous best
## Run 5 stress 0.1147353
## ... Procrustes: rmse 1.076378e-06 max resid 2.764694e-06
## ... Similar to previous best
## Run 6 stress 0.1897878
## Run 7 stress 0.1147353
## ... Procrustes: rmse 7.756061e-07 max resid 1.988763e-06
## ... Similar to previous best
## Run 8 stress 0.1147353
## ... Procrustes: rmse 2.345924e-06 max resid 4.272649e-06
## ... Similar to previous best
## Run 9 stress 0.1147353
## ... Procrustes: rmse 9.748908e-07 max resid 1.649597e-06
## ... Similar to previous best
## Run 10 stress 0.1315542
## Run 11 stress 0.1147353
## ... Procrustes: rmse 2.287383e-06 max resid 6.593762e-06
## ... Similar to previous best
## Run 12 stress 0.1147353
## ... Procrustes: rmse 2.571117e-06 max resid 9.029284e-06
## ... Similar to previous best
## Run 13 stress 0.1989472
## Run 14 stress 0.1147353
## ... Procrustes: rmse 1.473792e-06 max resid 4.245585e-06
## ... Similar to previous best
## Run 15 stress 0.1147353
## ... Procrustes: rmse 1.016549e-06 max resid 2.514926e-06
## ... Similar to previous best
## Run 16 stress 0.1147353
## ... Procrustes: rmse 2.840351e-06 max resid 1.071076e-05
## ... Similar to previous best
## Run 17 stress 0.1147353
## ... Procrustes: rmse 1.606471e-06 max resid 4.357874e-06
## ... Similar to previous best
## Run 18 stress 0.1147353
## ... New best solution
## ... Procrustes: rmse 6.285771e-07 max resid 1.277707e-06
## ... Similar to previous best
## Run 19 stress 0.1147353
## ... Procrustes: rmse 8.847204e-07 max resid 2.070173e-06
## ... Similar to previous best
## Run 20 stress 0.1344681
## *** Best solution repeated 2 times
phyloseq::plot_ordination(ps_rare, twosite_ord_NMDS, type="samples", color="SITE") + geom_point(size=5)
gg_ordiplot(twosite_ord_NMDS,sample_data(ps_rare)$SITE,spiders = TRUE,ellipse = T)
An alternative to Bray-Curtis distances among samples is to use the Unifrac distance. Unifrac takes into account the phylogenetic relatedness amongst community members using a sequence tree built from your sequences. ‘Unweighted’ unifrac simply uses genetic distance, whereas ‘Weighted’ unifrac takes into account the relative abundance of each microbial taxon using the abundance table
####### UNWEIGHTED UNIFRAC ORDINATION
#Calculate UW UNIFRAC
twosite_ord_unifrac_uw <- phyloseq::ordinate(ps_rare, method = "PCoA",distance="unifrac",weighted=FALSE)
#Base Phyloseq Plot
#phyloseq::plot_ordination(ps_rare, twosite_ord_unifrac_uw, type="samples", color="SITE")+ geom_point(size=5)
#Prettier Plot (note different syntax)
gg_ordiplot(twosite_ord_unifrac_uw$vectors,groups=sample_data(ps_rare)$SITE,spider=T)
####### WEIGHTED UNIFRAC ORDINATION
#Calculate W UNIFRAC
twosite_ord_unifrac_w <- phyloseq::ordinate(ps_rare, method = "PCoA",distance="unifrac",weighted=TRUE)
#Base Phyloseq Plot
#phyloseq::plot_ordination(ps_rare, twosite_ord_unifrac_w, type="samples", color="SITE")+ geom_point(size=5)
#Prettier Plot (note different syntax)
gg_ordiplot(twosite_ord_unifrac_w$vectors,groups=sample_data(ps_rare)$SITE,spider=T)
#CLR TRANSFORM
ps_clr<-microbiome::transform(ps_clean_samples, 'clr')
#Ordinate
ord_clr <- phyloseq::ordinate(ps_clr, "RDA")
#Plot with Phyloseq
phyloseq::plot_ordination(ps_clr, ord_clr, type="samples", color="SITE")
#GG Version
gg_ordiplot(ord_clr,sample_data(ps_clr)$SITE,spiders = TRUE,ellipse = T)
PERMANOVA is a randomisation procedure that will test for the effects of predictors of interest in driving differences in beta diversity / community structure. It runs in vegan, so we need the files we converted for vegan. ‘soil.v’ is our ASV abundance matrix, and the ‘data’ argument needs our sample metadata.
The nice thing about PERMANOVA is that it provides r2 values for effects as well as p values, so you can get an idea of % of explained variance and ‘importance’ of effects.
We can only use categorical predictors in PERMANOVA, so instead of infection load, lets see if microbial community structure varies based on whether individuals died from their Bd infections.
## 3.4 Statistical Testing Using PERMANOVA
#Function to Extract the Data from phyloseq in a format vegan can understand
vegan_otu <- function(physeq) {
OTU <- otu_table(physeq)
if (taxa_are_rows(OTU)) {
OTU <- t(OTU)
}
return(as(OTU, "matrix"))
}
#Convert OTU table to abundance matrix
frog.v<-vegan_otu(ps_rare)
#Convert Sample Data to
frog.s<-as(sample_data(ps_rare),"data.frame")
######################### TESTING FOR DIFFERENCES WITH PERMANOVA
frog.adonis<-adonis2(frog.v ~ SITE + mortality ,data=frog.s,permutations=10000,method="bray")
frog.adonis
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 10000
##
## adonis2(formula = frog.v ~ SITE + mortality, data = frog.s, permutations = 10000, method = "bray")
## Df SumOfSqs R2 F Pr(>F)
## SITE 1 1.7681 0.19309 6.5264 9.999e-05 ***
## mortality 1 0.3450 0.03768 1.2735 0.2157
## Residual 26 7.0437 0.76923
## Total 28 9.1568 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Network analysis allows us to visualise co-occurrence of microbial taxa, or similarity of samples based on microbial community profiles.
Samples closer together in the network have more similar community structures - i.e. share a higher number of common ASVs. We are using Jaccard distance here, which only accounts for presence/absence, not abundances of ASVs. Values near 1 mean very DISSIMILAR, whereas values near 0 mean very similar (small distance between them). We can make our cutoff more stringent by changing the ‘max.dist’ variable.
################# NETWORKS
ig <- make_network(ps_rare, "samples", distance = "jaccard", max.dist = 0.95)
plot_network(ig, ps_rare, type="samples", point_size = 5, label=NULL, color="SITE", line_alpha = 0.05)